Single Nucleotide Polymorphisms Caused by Assembly Errors

نویسندگان

  • Jürgen Kleffe
  • Robert Weißmann
  • Florian F Schmitzberger
چکیده

We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled contigs but on how the different sequence assemblies agree by content. Threefold consistently assembled genome regions are identified in order to estimate a lower bound of erroneously identified single nucleotide polymorphisms (SNP) caused by nothing but the process of mathematical sequence assembly. We identified 509 sequence triplets common to all three de-novo assemblies spanning only 34% (3.3 Mb) of the bacterial genome with 175 of these regions (~1.5 Mb) including erroneous SNPs and insertion/deletions. Within these triplets this on average leads to one error per 7,155 base pairs. Replacing the assembler Mira2 by the most recent version Mira3, the letter number even drops to 5,923. Our results therefore suggest that a considerably high number of erroneous SNPs may be present in current sequence data and mathematicians should urgently take up research on numerical stability of sequence assembly algorithms. Furthermore, even the latest versions of currently used assemblers produce erroneous SNPs that depend on the order reads are used as input. Such errors will severely hamper molecular diagnostics as well as relating genome variation and disease. This issue needs to be addressed urgently as the field is moving fast into clinical applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Association study of two single nucleotide polymorphisms rs10757278 and rs1333049 with atherosclerosis, a case-control study from Iraq

Atherosclerosis is one of the most important coronary artery disease (CAD) caused by lipid accumulation, hypertension, smoking, and many other factors such as environmental and genetic factors. It has been recorded that genetic variations in rs10757278 and rs1333049 are correlated with CAD. In the present study, 100 blood samples were collected (50 CAD patients and 50 appeared to be healthy con...

متن کامل

Association of -77T>C and Arg194trp polymorphisms of XRCC1 with risk of coronary artery diseases in Iranian population

Objective(s): Coronary artery disease (CAD) is the leading cause of death in both male and female worldwide. The main cause of CAD is the atherosclerosis of coronary arteries, which is, mostly caused by genetic alteration. 50% of such cases occur in mitotic cells where single-strand breaks occur spontaneously or due to ionizing radiation. X-ray repair cross-complementing protein 1 (XRCC1) as a ...

متن کامل

In-silico study to identify the pathogenic single nucleotide polymorphisms in the coding region of CDKN2A gene

Background: CDKN2A, encoding two important tumor suppressor proteins p16 and p14, is a tumor suppressor gene. Mutations in this gene and subsequently the defect in p16 and p14 proteins lead to the downregulation of RB1/p53 and cancer malignancy. To identify the structural and functional effects of mutations, various powerful bioinformatics tools are available. The aim of this study is the ident...

متن کامل

مطالعات درخت تصمیم در برآورد ریسک ابتلا به سرطان سینه با استفاده از چند شکلی‌های تک نوکلوئیدی

Abstract Introduction:   Decision tree is the data mining tools to collect, accurate prediction and sift information from massive amounts of data that are used widely in the field of computational biology and bioinformatics. In bioinformatics can be predict on diseases, including breast cancer. The use of genomic data including single nucleotide polymorphisms is a very important ...

متن کامل

Investigation of fimH Single Nucleotide Polymorphisms (C640T and T591A) in Uropathogenic E. coli Isolated from Patients with Urinary Tract Infections

Background: Urinary tract infections are one of the most frequent health problems and Uropathogenic Escherichia coli is the major pathogen resulting UTIs. The severity of UTIs is caused by the expression of a large range of virulence factors.In this study, we evaluated the allelic frequency fimH gene, in UPECs isolated from patients with UTIs. This study also aimed to determine the roles of C64...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2010